Contents

1 Background

This case-study is a subset of the data of the 6th study of the Clinical Proteomic Technology Assessment for Cancer (CPTAC). In this experiment, the authors spiked the Sigma Universal Protein Standard mixture 1 (UPS1) containing 48 different human proteins in a protein background of 60 ng/\(\mu\)L Saccharomyces cerevisiae strain BY4741. Two different spike-in concentrations were used: 6A (0.25 fmol UPS1 proteins/\(\mu\)L) and 6B (0.74 fmol UPS1 proteins/\(\mu\)L) [5]. We limited ourselves to the data of LTQ-Orbitrap W at site 56. The data were searched with MaxQuant version 1.5.2.8, and detailed search settings were described in Goeminne et al. (2016) [1]. Three replicates are available for each concentration.

2 Data

We first import the peptideRaws.txt file. This is the file that contains your peptideRaw-level intensities. For a MaxQuant search [6], this peptideRaws.txt file can be found by default in the “path_to_raw_files/combined/txt/” folder from the MaxQuant output, with “path_to_raw_files” the folder where raw files were saved. In this tutorial, we will use a MaxQuant peptideRaws file that is a subset of the cptac study and which is available in the msdata package. We will use the Features package to import the data.

We generate the object peptideRawFile with the path to the peptideRaws.txt file. With the grepEcols function we find the columns that are containing the expression data of the peptideRaws in the peptideRaws.txt file.

library(tidyverse)
library(limma)
library(Features)
library(msqrob2)

peptidesFile <- msdata::quant(pattern = "cptac_a_b_peptides", full.names = TRUE)
ecols <- MSnbase::grepEcols(peptidesFile, "Intensity ", split = "\t")
pe <- readFeatures(table = peptidesFile, fnames = 1, ecol = ecols,
                   name = "peptideRaw", sep="\t")

We can extract the spikein condition from the raw file name.

cond <- which(strsplit(colnames(pe)[[1]][1], split = "")[[1]] == "A") # find where condition is stored
colData(pe)$condition <- substr(colnames(pe), cond, cond) %>% unlist %>%  as.factor

We calculate how many non zero intensities we have per peptide This will be useful for filtering.

rowData(pe[["peptideRaw"]])$nNonZero <- rowSums(assay(pe[["peptideRaw"]]) > 0)

peptide with zero intensities are missing peptide and should be represent with a NA value instead of 0.

pe <- zeroIsNA(pe,"peptideRaw")

2.1 Data exploration

We can inspect the missingness in our data with the plotNA() function provided with MSnbase. 45% of all peptide intensities are missing and for some peptides we don’t even measure a signal in any sample. The missingness is similar across samples. Note, that we plot the peptide data, so the label protein in the plot refers to peptides.

MSnbase::plotNA(assay(pe[["peptideRaw"]]))

3 Preprocessing

We will log transform, normalize, filter and summarize the data.

3.1 Log transform the data

pe <- logTransform(pe, base = 2,i="peptideRaw",name="peptideLog")
limma::plotDensities(assay(pe[["peptideLog"]]))

3.2 Filtering

3.2.1 Handling overlapping protein groups

In our approach a peptide can map to multiple proteins, as long as there is none of these proteins present in a smaller subgroup.

pe[["peptideLog"]]<-pe[["peptideLog"]][rowData(pe[["peptideLog"]])$Proteins %in% smallestUniqueGroups(rowData(pe[["peptideLog"]])$Proteins),]

3.2.2 Remove reverse sequences (decoys) and contaminants

We now remove the contaminants, peptides that map to decoy sequences and proteins, which were only identified by peptides with modifications.

pe[["peptideLog"]] <- pe[["peptideLog"]][rowData(pe[["peptideLog"]])$Reverse!= "+", ]
pe[["peptideLog"]] <- pe[["peptideLog"]][rowData(pe[["peptideLog"]])$
Potential.contaminant!="+", ]

3.2.3 Remove peptides of proteins that were only identified with modified peptides

I will skip this step for the moment. Large protein groups file needed for this.

3.2.4 Drop peptides that were only identified in one sample

We want to keep peptides that were at least observed twice.

pe[["peptideLog"]] <- pe[["peptideLog"]][rowData(pe[["peptideLog"]])$nNonZero >= 2, ]
nrow(pe[["peptideLog"]])
## [1] 7011

We keep 7011 peptides upon filtering.

3.3 Quantile normalize the data

pe <- normalize(pe,i="peptideLog",method="quantiles",name="peptideNorm")

3.4 Explore quantile normalized data

Upon normalisation the density curves for all samples coincide.

limma::plotDensities(assay(pe[["peptideNorm"]]))

We can visualize our data using a Multi Dimensional Scaling plot, eg. as provided by the limma package.

limma::plotMDS(assay(pe[["peptideNorm"]]), col = as.numeric(colData(pe)$condition))

The first axis in the plot is showing the leading log fold changes (differences on the log scale) between the samples. We notice that the leading differences (log FC) in the peptideRaw data seems to be driven by technical variability. Indeed the samples do not seem to be clearly separated according to the spike in condition.

3.5 Summarization to protein level

Use the standard sumarisation in aggregateFeatures ( robust summarisation).

pe <- aggregateFeatures(pe,i="peptideNorm", fcol = "Proteins", na.rm = TRUE, name="protein")
## Your quantitative and row data contain missing values. Please read
## the relevant section(s) in the aggregateFeatures manual page
## regarding the effects of missing values on data aggregation.

We notice that the leading differences (log FC) in the protein data is still according to technical variation. On the second dimension, however, we also observe a clear separation according to the spike-in condition. Hence, the summarization that accounts for peptide specific effects makes the effects due to the spike-in condition more prominent!

plotMDS(assay(pe[["protein"]]),col = as.numeric(colData(pe)$condition))

4 Data Analysis

4.1 Estimation

We model the protein level expression values using msqrob. By default msqrob2 estimates the model parameters using robust regression.

pe <- msqrob(object=pe,i="protein", formula=~condition)

4.2 Inference

What are the parameter names of the model?

getCoef(rowData(pe[["protein"]])$msqrobModels[[1]])
## (Intercept)  conditionB 
##   15.060948    1.564747

Spike-in condition a is the reference class. So the mean log2 expression for samples from condition a is ‘(Intercept). The mean log2 expression for samples from condition B is’(Intercept)+conditionB’. Hence the average log2 fold change between condition b and condition a is modelled using the parameter ‘conditionB’. Hence we will assess the contrast ‘conditionB=0’ with our statistical test.

L <- makeContrast("conditionB=0",parameterNames=c("conditionB"))
pe <- hypothesisTest(object=pe,i="protein",contrast=L)

4.3 Plots

4.3.1 Volcano-plot

volcano <- ggplot(rowData(pe[["protein"]])$conditionB, aes(x = logFC, y = -log10(pval),
                                                    color = adjPval < 0.05)) +
  geom_point(cex = 2.5) + scale_color_manual(values = alpha(c("black", "red"), 0.5)) + theme_minimal()
volcano

4.3.2 Heatmap

We first select the names of the significant proteins.

sigNames <- rowData(pe[["protein"]])$conditionB %>% rownames_to_column("protein") %>% filter(adjPval<0.05) %>% pull(protein)
heatmap(assay(pe[["protein"]])[sigNames, ])

5 Detail plots

We first extract the normalized peptideRaw expression values for a particular protein.

for (protName in sigNames)
{
pePlot <- pe[protName,,c("peptideNorm","protein")]
pePlotDf <- data.frame(longFormat(pePlot))
pePlotDf$assay <- factor(pePlotDf$assay,
                        levels = c("peptideNorm", "protein"))
pePlotDf$condition <- as.factor(colData(pePlot)[pePlotDf$colname, "condition"])
p1 <- ggplot(data = pePlotDf,
       aes(x = colname,
           y = value,
           group = rowname)) +
    geom_line() + geom_point() +  theme_minimal() +
    facet_grid(~ assay) + ggtitle(protName)
print(p1)

p2 <- ggplot(pePlotDf, aes(x = colname, y = value, fill = condition)) + geom_boxplot(outlier.shape = NA) +
geom_point(position = position_jitter(width = .1), aes(shape = rowname)) + scale_shape_manual(values = 1:nrow(pePlotDf)) +
labs(title = protName, x = "sample", y = "peptide intensity (log2)") + theme_minimal()
facet_grid(~assay)
print(p2)
}